DTSA 5510 Unsupervised Algorithms in Machine Learning Final Project
2023.09.17 D. Ikoma
This project shows how to perform mall customer segmentation using unsupervised machine learning algorithms. This is an unsupervised clustering problem that compares three popular algorithms: KMeans, Hierarchical clustering, and DBSCAN.
# import libraries
import os
import time
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly as py
import plotly.graph_objs as go
import numpy as np
from numpy import unique
from scipy import stats
from scipy.stats import pearsonr
from itertools import product
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings("ignore")
In this section, raw data will be read, overviewed and checked if any cleaning is required (https://www.kaggle.com/datasets/shwetabh123/mall-customers)[1]. There was an error in the column name 'Gender' as 'Genre', so we have corrected the dataset in advance.
# file size
file = './data/Mall_Customers.csv'
file_size = os.path.getsize(file)
print('file size: ', f"{round(file_size/ (1024), 2)} MB")
file size: 3.89 MB
# Read data into a DataFrame
mall_data = pd.read_csv(file)
print('There are {} rows and {} columns in our dataset.\n'.format(mall_data.shape[0], mall_data.shape[1]))
mall_data.info()
There are 200 rows and 5 columns in our dataset. <class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 200 non-null int64 1 Gender 200 non-null object 2 Age 200 non-null int64 3 Annual Income (k$) 200 non-null int64 4 Spending Score (1-100) 200 non-null int64 dtypes: int64(4), object(1) memory usage: 7.9+ KB
mall_data.head()
| CustomerID | Gender | Age | Annual Income (k$) | Spending Score (1-100) | |
|---|---|---|---|---|---|
| 0 | 1 | Male | 19 | 15 | 39 |
| 1 | 2 | Male | 21 | 15 | 81 |
| 2 | 3 | Female | 20 | 16 | 6 |
| 3 | 4 | Female | 23 | 16 | 77 |
| 4 | 5 | Female | 31 | 17 | 40 |
mall_data.describe()
| CustomerID | Age | Annual Income (k$) | Spending Score (1-100) | |
|---|---|---|---|---|
| count | 200.000000 | 200.000000 | 200.000000 | 200.000000 |
| mean | 100.500000 | 38.850000 | 60.560000 | 50.200000 |
| std | 57.879185 | 13.969007 | 26.264721 | 25.823522 |
| min | 1.000000 | 18.000000 | 15.000000 | 1.000000 |
| 25% | 50.750000 | 28.750000 | 41.500000 | 34.750000 |
| 50% | 100.500000 | 36.000000 | 61.500000 | 50.000000 |
| 75% | 150.250000 | 49.000000 | 78.000000 | 73.000000 |
| max | 200.000000 | 70.000000 | 137.000000 | 99.000000 |
The summary of 5 columns is:
There is one binary, categorical column: 'Gender'. We will exclude binary data as it is inappropriate for clustering.
We will check for null data and perform basic data cleaning.
mall_data.isnull().sum()
CustomerID 0 Gender 0 Age 0 Annual Income (k$) 0 Spending Score (1-100) 0 dtype: int64
There are 5 columns and there is no null data in any column, so data cleaning is not necessary to handle null data.
In this section, we perform EDA such as the distribution and correlation of the data set to form a target for the models. First, we will let's check the age distribution by gender.
males_age = mall_data[mall_data['Gender'] == 'Male']['Age'] # subset with males age
females_age = mall_data[mall_data['Gender'] == 'Female']['Age'] # subset with females age
age_bins = range(15,75,5)
# males histogram
fig2, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,5), sharey=True)
sns.distplot(males_age, bins=age_bins, kde=False, color='#0066ff', ax=ax1, hist_kws=dict(edgecolor="k", linewidth=2))
ax1.set_xticks(age_bins)
ax1.set_ylim(top=25)
ax1.set_title('Males')
ax1.set_ylabel('Count')
ax1.text(45, 23, "TOTAL count: {}".format(males_age.count()))
ax1.text(45, 22, "Mean age: {:.1f}".format(males_age.mean()))
# females histogram
sns.distplot(females_age, bins=age_bins, kde=False, color='#cc66ff', ax=ax2, hist_kws=dict(edgecolor="k", linewidth=2))
ax2.set_xticks(age_bins)
ax2.set_title('Females')
ax2.set_ylabel('Count')
ax2.text(45, 23, "TOTAL count: {}".format(females_age.count()))
ax2.text(45, 22, "Mean age: {:.1f}".format(females_age.mean()))
plt.show()
The total number of females is higher, but the average is about the same. For males, the distribution is relatively flat with respect to age, whereas for females, the frequency tends to be higher between the ages of 20 and 50. Next, we will check the annual income distribution by gender.
males_income = mall_data[mall_data['Gender']=='Male']['Annual Income (k$)'] # subset with males income
females_income = mall_data[mall_data['Gender']=='Female']['Annual Income (k$)'] # subset with females income
my_bins = range(10,150,10)
# males histogram
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,5))
sns.distplot(males_income, bins=my_bins, kde=False, color='#0066ff', ax=ax1, hist_kws=dict(edgecolor="k", linewidth=2))
ax1.set_xticks(my_bins)
ax1.set_yticks(range(0,24,2))
ax1.set_ylim(0,22)
ax1.set_title('Males')
ax1.set_ylabel('Count')
ax1.text(85,19, "Mean income: {:.1f}k$".format(males_income.mean()))
ax1.text(85,18, "Median income: {:.1f}k$".format(males_income.median()))
ax1.text(85,17, "Std. deviation: {:.1f}k$".format(males_income.std()))
# females histogram
sns.distplot(females_income, bins=my_bins, kde=False, color='#cc66ff', ax=ax2, hist_kws=dict(edgecolor="k", linewidth=2))
ax2.set_xticks(my_bins)
ax2.set_yticks(range(0,24,2))
ax2.set_ylim(0,22)
ax2.set_title('Females')
ax2.set_ylabel('Count')
ax2.text(85,19, "Mean income: {:.1f}k$".format(females_income.mean()))
ax2.text(85,18, "Median income: {:.1f}k$".format(females_income.median()))
ax2.text(85,17, "Std. deviation: {:.1f}k$".format(females_income.std()))
plt.show()
There are no characteristic large differences in mean, median, and standard deviation between the distributions of males and females. The distribution of annual income has a peak around 70k$. Finally, regarding 'Spending Score', we will check the distribution of Males and Females.
males_spending = mall_data[mall_data['Gender']=='Male']['Spending Score (1-100)'] # subset with males age
females_spending = mall_data[mall_data['Gender']=='Female']['Spending Score (1-100)'] # subset with females age
spending_bins = range(0,105,5)
# males histogram
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,5))
sns.distplot(males_spending, bins=spending_bins, kde=False, color='#0066ff', ax=ax1, hist_kws=dict(edgecolor="k", linewidth=2))
ax1.set_xticks(spending_bins)
ax1.set_xlim(0,100)
ax1.set_yticks(range(0,17,1))
ax1.set_ylim(0,16)
ax1.set_title('Males')
ax1.set_ylabel('Count')
ax1.text(50,15, "Mean spending score: {:.1f}".format(males_spending.mean()))
ax1.text(50,14, "Median spending score: {:.1f}".format(males_spending.median()))
ax1.text(50,13, "Std. deviation score: {:.1f}".format(males_spending.std()))
# females histogram
sns.distplot(females_spending, bins=spending_bins, kde=False, color='#cc66ff', ax=ax2, hist_kws=dict(edgecolor="k", linewidth=2))
ax2.set_xticks(spending_bins)
ax2.set_xlim(0,100)
ax2.set_yticks(range(0,17,1))
ax2.set_ylim(0,16)
ax2.set_title('Females')
ax2.set_ylabel('Count')
ax2.text(50,15, "Mean spending score: {:.1f}".format(females_spending.mean()))
ax2.text(50,14, "Median spending score: {:.1f}".format(females_spending.median()))
ax2.text(50,13, "Std. deviation score: {:.1f}".format(females_spending.std()))
plt.show()
There doesn't seem to be much difference in the distribution of 'Spending Score' between Males and Females. The mean is slightly larger for Females, with a peak at 40-45.
Next, we will check the correlation. First, we will check the correlation between 'Age' and 'Annual Income' for Males and Females.
# calculating Pearson's correlations
corr1, _ = pearsonr(males_age.values, males_income.values)
corr2, _ = pearsonr(females_age.values, females_income.values)
sns.lmplot('Age', 'Annual Income (k$)', data=mall_data, hue='Gender',
aspect=1.5)
plt.text(15,87, 'Pearson: {:.2f}'.format(corr1), color='blue')
plt.text(65,80, 'Pearson: {:.2f}'.format(corr2), color='green')
plt.show()
The correlation coefficients for both Males and Females are small, and there is no linear correlation between 'Age' and 'Annual Income'. Next, we will check the correlation between 'Age' and 'Spending Score'.
# calculating Pearson's correlations
corr1, _ = pearsonr(males_age.values, males_spending.values)
corr2, _ = pearsonr(females_age.values, females_spending.values)
sns.lmplot('Age', 'Spending Score (1-100)', data=mall_data, hue='Gender',
aspect=1.5)
plt.text(65,65, 'Pearson: {:.2f}'.format(corr1), color='blue')
plt.text(13,83, 'Pearson: {:.2f}'.format(corr2), color='green')
plt.show()
There is a negative correlation between 'Age' and 'Spending Score' for both males and females, and as 'Age' increases, 'Spending Score' decreases. The absolute value of the correlation coefficient for Females is -0.38, which is larger than -0.28 for Males. Finally, we will check the correlation between 'Annual Income' and 'Spending Score'.
# calculating Pearson's correlations
corr1, _ = pearsonr(males_income.values, males_spending.values)
corr2, _ = pearsonr(females_income.values, females_spending.values)
sns.lmplot('Annual Income (k$)', 'Spending Score (1-100)', data=mall_data, hue='Gender',
aspect=1.5)
plt.text(130,23, 'Pearson: {:.2f}'.format(corr1), color='blue')
plt.text(130,77, 'Pearson: {:.2f}'.format(corr2), color='green')
plt.show()
There is no linear correlation for both Males and Females, but nonlinear features can be seen in the scatter plots. They appear to form five groups, and will be used as a reference for examining parameters for unsupervised learning.
We have summarized the distribution and correlation for each feature, 'Age', 'Annual Income', and 'Spending Score', for Males and Females. Overall, there was no difference in trends between males and females, and no outliers. Also, clusters could be confirmed in the scatter plot of 'Annual Income' and 'Spending Score'.
The problem of this project is a classification problem for unsupervised learning. We compare and verify three models: KMeans, Hierarchical clustering, and DBSCAN. The binary data 'Gendre' is excluded from the analysis because it is not suitable as a feature for the classification problem and there is no difference between Males and Females based on the EDA results.
We will perform feature scaling before modeling. The targets are 3 features: 'Age', 'Annual Income', and 'Spending Score'.
mall_data_scaled = mall_data[["Age","Annual Income (k$)","Spending Score (1-100)"]]
# Class instance
scaler = StandardScaler()
# Fit_transform
mall_data_scaled = scaler.fit_transform(mall_data_scaled)
mall_data_scaled = pd.DataFrame(mall_data_scaled)
mall_data_scaled.columns = ["Age","Annual Income (k$)","Spending Score (1-100)"]
mall_data_scaled.head()
| Age | Annual Income (k$) | Spending Score (1-100) | |
|---|---|---|---|
| 0 | -1.424569 | -1.738999 | -0.434801 |
| 1 | -1.281035 | -1.738999 | 1.195704 |
| 2 | -1.352802 | -1.700830 | -1.715913 |
| 3 | -1.137502 | -1.700830 | 1.040418 |
| 4 | -0.563369 | -1.662660 | -0.395980 |
mall_data_scaled.describe()
| Age | Annual Income (k$) | Spending Score (1-100) | |
|---|---|---|---|
| count | 2.000000e+02 | 2.000000e+02 | 2.000000e+02 |
| mean | -1.021405e-16 | -2.131628e-16 | -1.465494e-16 |
| std | 1.002509e+00 | 1.002509e+00 | 1.002509e+00 |
| min | -1.496335e+00 | -1.738999e+00 | -1.910021e+00 |
| 25% | -7.248436e-01 | -7.275093e-01 | -5.997931e-01 |
| 50% | -2.045351e-01 | 3.587926e-02 | -7.764312e-03 |
| 75% | 7.284319e-01 | 6.656748e-01 | 8.851316e-01 |
| max | 2.235532e+00 | 2.917671e+00 | 1.894492e+00 |
We performed the Standard Scaler on each feature and were able to transform it so that the means were 0 and the standard deviations were 1.
First, we will apply KMeans clustering. We set the number of clusters k as a hyperparameter and check the elbow plot.
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 10))
visualizer.fit(mall_data_scaled)
visualizer.show()
plt.show()
From the above elbow plot, elbow can be read as k=4. Next, we will check the cluster when k=4 using a scatter plot.
# ScatterPlot when K=4
# Record the start time of the process
start_time = time.time()
kmeans = KMeans(n_clusters=4, max_iter=50)
kmeans.fit(mall_data_scaled)
# Record the end time of processing
end_time = time.time()
# Calculate processing time (in seconds)
elapsed_time = end_time - start_time
print(f"processing time: {elapsed_time} sec\n")
mall_data["Label"] = kmeans.labels_
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].set_title("Plotting the data into 4 clusters", fontsize=20)
sns.scatterplot(data=mall_data, x="Age", y="Annual Income (k$)", hue="Label", s=60, palette=['green', 'orange', 'blue', 'red'], ax=axes[0])
axes[1].set_title("Plotting the data into 4 clusters", fontsize=20)
sns.scatterplot(data=mall_data, x="Spending Score (1-100)", y="Annual Income (k$)", hue="Label", s=60, palette=['green', 'orange', 'blue', 'red'], ax=axes[1])
plt.show()
processing time: 0.03966546058654785 sec
In the diagram on the left, Label=0 (green) and Label=2 (blue) overlap. Also, in the figure on the right, Label=1(orange) and Label=1(red) overlap. In the figure on the right, k=5 looks appropriate, so let's visualize the case of k=5 as well.
# ScatterPlot when K=5
kmeans_5 = KMeans(n_clusters=5, max_iter=50)
kmeans_5.fit(mall_data_scaled)
mall_data["Label_5"] = kmeans_5.labels_
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].set_title("Plotting the data into 5 clusters", fontsize=20)
sns.scatterplot(data=mall_data, x="Age", y="Annual Income (k$)", hue="Label_5", s=60, palette=['green', 'orange', 'blue', 'red', 'brown'], ax=axes[0])
axes[1].set_title("Plotting the data into 5 clusters", fontsize=20)
sns.scatterplot(data=mall_data, x="Spending Score (1-100)", y="Annual Income (k$)", hue="Label_5", s=60, palette=['green', 'orange', 'blue', 'red', 'brown'], ax=axes[1])
plt.show()
In the figure on the right, the overlap between Label=2 (blue) and Label=4 (brown) has not been resolved, and I think there is no advantage in changing from k=4 to k=5. Next, when k=4, we will check the clustering in the 3D plot.
def tracer(db, n, name):
'''
This function returns trace object for Plotly
'''
return go.Scatter3d(
x = db[db['Label']==n]['Age'],
y = db[db['Label']==n]['Spending Score (1-100)'],
z = db[db['Label']==n]['Annual Income (k$)'],
mode = 'markers',
name = name,
marker = dict(
size = 5
)
)
trace0 = tracer(mall_data, 0, 'Label 0')
trace1 = tracer(mall_data, 1, 'Label 1')
trace2 = tracer(mall_data, 2, 'Label 2')
trace3 = tracer(mall_data, 3, 'Label 3')
data = [trace0, trace1, trace2, trace3]
layout = go.Layout(
title = 'Clusters by KMeans',
scene = dict(
xaxis = dict(title = 'Age'),
yaxis = dict(title = 'Spending Score'),
zaxis = dict(title = 'Annual Income')
),
height = 600
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
Checking the 3D plot, we can see that the clusters that appeared to overlap in the 2D scatter plot are clearly separated.
Finally, the characteristics of the class are summarized as follows.
| Cluster (Label) | Age | Annual Income | Spending Score |
|---|---|---|---|
| Cluster 0 | - | high | low |
| Cluster 1 | high | low | low |
| Cluster 2 | low | high | high |
| Cluster 3 | low | low | high |
Next, we will perform hierarchical clustering. The model uses SKlearn's AggomerativeClustering.
model_agg = AgglomerativeClustering(distance_threshold=0, n_clusters=None).fit(mall_data_scaled)
def plot_dendrogram(model, **kwargs):
# Create linkage matrix and then plot the dendrogram
# create the counts of samples under each node
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack(
[model.children_, model.distances_, counts]
).astype(float)
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
plt.title('Hierarchical clustering')
plot_dendrogram(model_agg, truncate_mode='level', p=4)
plt.xlabel("Number of points in node.")
plt.show()
The height of the dendrogram can be regarded as a hyperparameter, and similar to KMeans, it is well classified by the height (distance between clusters) where k=4.
# ScatterPlot when K=4
# Record the start time of the process
start_time = time.time()
n_clusters = 4
model_agg = AgglomerativeClustering(n_clusters=n_clusters).fit(mall_data_scaled)
# Record the end time of processing
end_time = time.time()
# Calculate processing time (in seconds)
elapsed_time = end_time - start_time
print(f"processing time: {elapsed_time} sec\n")
labels = model_agg.labels_
mall_data["Label_agg"] = labels
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].set_title("Plotting the data into 4 clusters", fontsize=20)
sns.scatterplot(data=mall_data, x="Age", y="Annual Income (k$)", hue="Label_agg", s=60, palette=['green', 'orange', 'blue', 'red'], ax=axes[0])
axes[1].set_title("Plotting the data into 4 clusters", fontsize=20)
sns.scatterplot(data=mall_data, x="Spending Score (1-100)", y="Annual Income (k$)", hue="Label_agg", s=60, palette=['green', 'orange', 'blue', 'red'], ax=axes[1])
plt.show()
processing time: 0.0024602413177490234 sec
Hierarchical clustering also achieved classification similar to KMeans.
As the third model, we apply DBSCAN. Since DBSCAN was not explained in the course, we will provide an overview of the model in this section. DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise and is one of clustering algorithms implemented in scikit-learn library [2]. As the name of paper suggests the core idea of DBSCAN is around concept of dense regions. The assumption is that natural clusters are composed of densely located points. This requires definition of “dense region”. To do these two parameters are required for DBSCAN algorithm. The first one is Eps, ε - distance, and the second one is MinPts – Minimum number of points within distance Eps. Optionally the distance metric can be specified by a user, but usually Euclidean distance is implemented (like in scikit learn). A “dense region” is therefore created by a minimum number of points within distance between all of them, Eps. Points which are within this distance but not close to minimum number of other points are treated as “border points”. Remaining ones are noise or outliers.
Pros
Cons
We will first create a matrix of investigated hyperparameter combinations.
eps_values = np.arange(0.5, 0.7, 0.05) # eps values to be investigated
min_samples = np.arange(5, 12) # min_samples values to be investigated
DBSCAN_params = list(product(eps_values, min_samples))
no_of_clusters = []
sil_score = []
for p in DBSCAN_params:
DBS_clustering = DBSCAN(eps=p[0], min_samples=p[1]).fit(mall_data_scaled)
no_of_clusters.append(len(np.unique(DBS_clustering.labels_)))
sil_score.append(silhouette_score(mall_data_scaled, DBS_clustering.labels_))
A heatplot below shows how many clusters were generated by the DBSCAN algorithm for the respective parameters combinations.
tmp = pd.DataFrame.from_records(DBSCAN_params, columns =['Eps', 'Min_samples'])
tmp['No_of_clusters'] = no_of_clusters
pivot_1 = pd.pivot_table(tmp, values='No_of_clusters', index='Min_samples', columns='Eps')
fig, ax = plt.subplots(figsize=(10,6))
sns.heatmap(pivot_1, annot=True,annot_kws={"size": 16}, cmap="YlGnBu", ax=ax)
ax.set_title('Number of clusters')
plt.show()
The heatplot above shows, the number of clusters vary from 3 to 7. To decide which combination to choose we will use a metric - a silhuette score and we will plot it as a heatmap again.
tmp = pd.DataFrame.from_records(DBSCAN_params, columns =['Eps', 'Min_samples'])
tmp['Sil_score'] = sil_score
pivot_1 = pd.pivot_table(tmp, values='Sil_score', index='Min_samples', columns='Eps')
fig, ax = plt.subplots(figsize=(12,6))
sns.heatmap(pivot_1, annot=True, annot_kws={"size": 10}, cmap="YlGnBu", ax=ax)
plt.show()
Global maximum is 0.29 for eps=0.6 and min_samples=10, leading to 5 clusters .
# Record the start time of the process
start_time = time.time()
DBS_clustering = DBSCAN(eps=0.6, min_samples=10).fit(mall_data_scaled)
# Record the end time of processing
end_time = time.time()
# Calculate processing time (in seconds)
elapsed_time = end_time - start_time
print(f"processing time: {elapsed_time} sec")
labels = DBS_clustering.labels_ # append labels to points
mall_data["Label_DBS"] = labels
processing time: 0.0030374526977539062 sec
Finally, we will draw a scatter plot. DBSCAN identifies outliers as Label=-1, so outliers are also displayed in the scatter plot.
outliers = mall_data[mall_data["Label_DBS"]==-1]
fig2, axes = plt.subplots(1,2,figsize=(12,5))
sns.scatterplot(x="Age", y="Annual Income (k$)",
data=mall_data[mall_data["Label_DBS"]!=-1],
hue='Label_DBS', ax=axes[0], palette=['green', 'orange', 'blue', 'red'], legend='full', s=60)
sns.scatterplot(x="Spending Score (1-100)", y="Annual Income (k$)",
data=mall_data[mall_data["Label_DBS"]!=-1],
hue='Label_DBS', ax=axes[1], palette=['green', 'orange', 'blue', 'red'], legend='full', s=60)
axes[0].set_title("Plotting the data into 4 clusters", fontsize=20)
axes[1].set_title("Plotting the data into 4 clusters", fontsize=20)
axes[0].scatter(outliers['Age'], outliers['Annual Income (k$)'], s=5, label='outliers', c="k")
axes[1].scatter(outliers['Spending Score (1-100)'], outliers['Annual Income (k$)'], s=5, label='outliers', c="k")
axes[0].legend()
axes[1].legend()
plt.setp(axes[0].get_legend().get_texts(), fontsize='10')
plt.setp(axes[1].get_legend().get_texts(), fontsize='10')
plt.show()
DBSCAN recognizes areas with low 'Spending Score' as outliers, and in this case it seems that the identification was not successful. we tried checking the scatter plot with some other hyperparameters, but we couldn't identify areas with low 'Spending Score' well.
First, let's summarize processing time. (Please note that processing time may vary slightly from trial to trial.)
| model | processing time |
|---|---|
| KMeans | 32.763 msec |
| Hierarchical clustering | 5.064 msec |
| DBSCAN | 4.007 msec |
In terms of computational load, it can be seen that Hierarchical clustering and DBSCAN are equally fast, but KMeans takes over 6 times longer. Next, we will compare the three previous models in terms of discrimination performance using a scatter diagram.
# KMeans when K=4
fig, axes = plt.subplots(3, 2, figsize=(12, 18))
axes[0, 0].set_title("KMeans clustering", fontsize=20)
sns.scatterplot(data=mall_data, x="Age", y="Annual Income (k$)", hue="Label", s=60, palette=['green', 'orange', 'blue', 'red'], ax=axes[0, 0])
axes[0, 1].set_title("KMeans clustering", fontsize=20)
sns.scatterplot(data=mall_data, x="Spending Score (1-100)", y="Annual Income (k$)", hue="Label", s=60, palette=['green', 'orange', 'blue', 'red'], ax=axes[0, 1])
# Hierarchical clustering when K=4
axes[1, 0].set_title("Hierarchical clustering", fontsize=20)
sns.scatterplot(data=mall_data, x="Age", y="Annual Income (k$)", hue="Label_agg", s=60, palette=['green', 'orange', 'blue', 'red'], ax=axes[1, 0])
axes[1, 1].set_title("Hierarchical clustering", fontsize=20)
sns.scatterplot(data=mall_data, x="Spending Score (1-100)", y="Annual Income (k$)", hue="Label_agg", s=60, palette=['green', 'orange', 'blue', 'red'], ax=axes[1, 1])
# DBSCAN
sns.scatterplot(x="Age", y="Annual Income (k$)", data=mall_data[mall_data["Label_DBS"]!=-1],
hue='Label_DBS', ax=axes[2, 0], palette=['green', 'orange', 'blue', 'red'], legend='full', s=60)
sns.scatterplot(x="Spending Score (1-100)", y="Annual Income (k$)", data=mall_data[mall_data["Label_DBS"]!=-1],
hue='Label_DBS', ax=axes[2, 1], palette=['green', 'orange', 'blue', 'red'], legend='full', s=60)
axes[2, 0].set_title("DBSCAN", fontsize=20)
axes[2, 1].set_title("DBSCAN", fontsize=20)
axes[2, 0].scatter(outliers['Age'], outliers['Annual Income (k$)'], s=5, label='outliers', c="k")
axes[2, 1].scatter(outliers['Spending Score (1-100)'], outliers['Annual Income (k$)'], s=5, label='outliers', c="k")
axes[2, 0].legend()
axes[2, 1].legend()
plt.setp(axes[2, 0].get_legend().get_texts(), fontsize='10')
plt.setp(axes[2, 1].get_legend().get_texts(), fontsize='10')
plt.show()
KMeans clustering and Hierarchical clustering show almost the same classification results, although there are some differences in details. On the other hand, DBSCAN recognizes a lot of data as outliers and cannot classify them well. Considering the calculation time and classification performance, hierarchical clustering is considered to be the most suitable for this data.
Finally, the characteristics of the class are summarized as follows.
| Cluster (Label) | Age | Annual Income | Spending Score |
|---|---|---|---|
| Cluster 0 | high | low | low |
| Cluster 1 | low | low | high |
| Cluster 2 | low | high | high |
| Cluster 3 | mid | high | low |
The conclusions of this project are summarized below.
As a future issue, it is necessary to dig deeper into the reasons why DBSCAN's classification was not successful, and to consider further metrics and hyperparameters.
[1] Dataset: https://www.kaggle.com/datasets/shwetabh123/mall-customers.
[2] Ester Martin, Kriegel Hans-Peter, Sander Jörg, and Xu Xiaowei, "A density-based algorithm for discovering clusters in large spatial databases with noise.," In KDD, 226–231, 1996.
[3] GitHub repository, https://github.com/DaisakuIkoma/CU_MSDS_USML.